{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/integrations/openai/beyond_search_webinar/00_multi-extraction.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/integrations/openai/beyond_search_webinar/00_multi-extraction.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform data into larger chunks of around 500 tokens\n",
    "This assumes that the dataset is ordered sequentially. If this is not the case, then this part should be skipped"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install openai pandas transformers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>docs</th>\n",
       "      <th>category</th>\n",
       "      <th>thread</th>\n",
       "      <th>href</th>\n",
       "      <th>question</th>\n",
       "      <th>context</th>\n",
       "      <th>marked</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3801</th>\n",
       "      <td>pytorch</td>\n",
       "      <td>Captum</td>\n",
       "      <td>GradCam - Zero output</td>\n",
       "      <td>https://discuss.pytorch.org/t/gradcam-zero-out...</td>\n",
       "      <td>Hello everyone,\\nI just started to dig into Ex...</td>\n",
       "      <td>Hi @Francesco_Grimaldi, this is a surprising r...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2766</th>\n",
       "      <td>pytorch</td>\n",
       "      <td>vision</td>\n",
       "      <td>Dynamic vs static set of images</td>\n",
       "      <td>https://discuss.pytorch.org/t/dynamic-vs-stati...</td>\n",
       "      <td>I have a problem where I want to create a netw...</td>\n",
       "      <td>You can train for a large set and pad shorter ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5802</th>\n",
       "      <td>streamlit</td>\n",
       "      <td>Streamlit Cloud</td>\n",
       "      <td>How to limit max upload size on share.streamli...</td>\n",
       "      <td>https://discuss.streamlit.io/t/how-to-limit-ma...</td>\n",
       "      <td>Hello\\nI am trying to connect PostgreSQL on He...</td>\n",
       "      <td>Hi @frances \\nStreamlit looks at your requirem...</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4157</th>\n",
       "      <td>tensorflow</td>\n",
       "      <td>General Discussion</td>\n",
       "      <td>Clear the graph and free the GPU memory in Ten...</td>\n",
       "      <td>https://discuss.tensorflow.org/t/clear-the-gra...</td>\n",
       "      <td>I’m training multiple models sequentially, whi...</td>\n",
       "      <td>We had another thread about this at:\\n\\n  \\n  ...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2385</th>\n",
       "      <td>huggingface</td>\n",
       "      <td>🤗Tokenizers</td>\n",
       "      <td>How to add additional custom pre-tokenization ...</td>\n",
       "      <td>https://discuss.huggingface.co/t/how-to-add-ad...</td>\n",
       "      <td>I would like to add a few custom functions for...</td>\n",
       "      <td>Hi @reSearch2vec\\nThere are multiple ways to c...</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             docs            category  \\\n",
       "3801      pytorch              Captum   \n",
       "2766      pytorch              vision   \n",
       "5802    streamlit     Streamlit Cloud   \n",
       "4157   tensorflow  General Discussion   \n",
       "2385  huggingface         🤗Tokenizers   \n",
       "\n",
       "                                                 thread  \\\n",
       "3801                              GradCam - Zero output   \n",
       "2766                    Dynamic vs static set of images   \n",
       "5802  How to limit max upload size on share.streamli...   \n",
       "4157  Clear the graph and free the GPU memory in Ten...   \n",
       "2385  How to add additional custom pre-tokenization ...   \n",
       "\n",
       "                                                   href  \\\n",
       "3801  https://discuss.pytorch.org/t/gradcam-zero-out...   \n",
       "2766  https://discuss.pytorch.org/t/dynamic-vs-stati...   \n",
       "5802  https://discuss.streamlit.io/t/how-to-limit-ma...   \n",
       "4157  https://discuss.tensorflow.org/t/clear-the-gra...   \n",
       "2385  https://discuss.huggingface.co/t/how-to-add-ad...   \n",
       "\n",
       "                                               question  \\\n",
       "3801  Hello everyone,\\nI just started to dig into Ex...   \n",
       "2766  I have a problem where I want to create a netw...   \n",
       "5802  Hello\\nI am trying to connect PostgreSQL on He...   \n",
       "4157  I’m training multiple models sequentially, whi...   \n",
       "2385  I would like to add a few custom functions for...   \n",
       "\n",
       "                                                context  marked  \n",
       "3801  Hi @Francesco_Grimaldi, this is a surprising r...       0  \n",
       "2766  You can train for a large set and pad shorter ...       0  \n",
       "5802  Hi @frances \\nStreamlit looks at your requirem...       1  \n",
       "4157  We had another thread about this at:\\n\\n  \\n  ...       0  \n",
       "2385  Hi @reSearch2vec\\nThere are multiple ways to c...       0  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import openai\n",
    "\n",
    "openai.api_key = '<<YOUR_API_KEY>>'  # beta.openai.com/login/\n",
    "\n",
    "hf = pd.read_json('data/huggingface-qa.jsonl', lines=True)\n",
    "pt = pd.read_json('data/pytorch-qa.jsonl', lines=True)\n",
    "tf = pd.read_json('data/tensorflow-qa.jsonl', lines=True)\n",
    "sl = pd.read_json('data/streamlit-qa.jsonl', lines=True)\n",
    "df = pd.concat([hf, pt, tf, sl], ignore_index=True)\n",
    "df.sample(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['streamlit', 'Using Streamlit',\n",
       "       '50MB dataset limitation when using Plotly.py',\n",
       "       'https://discuss.streamlit.io/t/50mb-dataset-limitation-when-using-plotly-py/9464',\n",
       "       'Hi there,\\nI’m using Plotly.py to show figures in my Streamlit app with big datasets. And now I get a notification of dataset oversizing, bigger than 50.0MB, is there anyway to solve this?',\n",
       "       'Thanks for the reply! I found this github link before I posted this issue, but it didn’t solve my problem.\\nBut later my friends and I found this error message in ~/site-packages/streamlit/server/server_util.py, and changed the parameter MESSAGE_LIMIT_SIZE to 200*1e6, making the write limit now 200 MB',\n",
       "       1], dtype=object)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.values[5080]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "huggingface    2765\n",
       "tensorflow     1224\n",
       "streamlit      1115\n",
       "pytorch        1061\n",
       "Name: docs, dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.docs.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "88f3eac241a740328a08f0feb1da6a69",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f018b6c9293943b3b214941e596a8cad",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "90ea42460267407d8e154fbf0d2ec2f0",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ca92db898e24457abb91909239409a69",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Token indices sequence length is longer than the specified maximum sequence length for this model (1718 > 1024). Running this sequence through the model will result in indexing errors\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:>"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXcAAAD4CAYAAAAXUaZHAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAQjklEQVR4nO3dbYyl9VnH8e/lIrTu6NCWOsFd4iwZQtywiS2T0FVjZttahtpBbRq7G6qgtJNqMD40sbupL/CFaatiTFMUR4trFJki1pZdVrGi5wUJoexGLYvbtVSwDFRo1RydtcYuXr4495Tj9MzueX747/eTTDj3/9wPv5ndvbjnuu/zvyMzkSSV5ZtGHUCS1H8Wd0kqkMVdkgpkcZekAlncJalAF406AMBll12Ws7OzHW935swZtm/f3v9AQzCp2c09XOYerknLfeLEia9k5mtbvTcWxX12dpbjx493vF2tVmNhYaH/gYZgUrObe7jMPVyTljsi/nmr92zLSFKBRlrcI2IpIlbq9fooY0hScUZa3DPzSGYuT09PjzKGJBXHtowkFcjiLkkFsrhLUoEs7pJUIIu7JBVoLD7E1C+zBx/8+utnPvSDI0wiSaPlmbskFaioM/dmnsVLupB55i5JBbK4S1KBLO6SVCCLuyQVyOIuSQWa+Ltlmu+KkSQ19P3MPSK+KyLuioj7I+Kn+r1/SdL5tVXcI+LuiHgxIk5uGl+MiNMR8VREHATIzFOZ+V7gR4H5/keWJJ1Pu2fuh4HF5oGI2AbcCdwA7AYORMTu6r0bgUeAh/uWVJLUtsjM9laMmAWOZuY11fJe4PbMvL5aPgSQmR9s2ubBzGz58dCIWAaWAWZmZq5dXV3tOPz6+jpP118673p7dozfk57W19eZmpoadYyOmXu4zD1ck5Z73759JzKzZYeklwuqO4Bnm5bXgOsiYgF4O3AJcGyrjTNzBVgBmJ+fz26eOF6r1bjjkTPnXe+Zmzrf96BN2lPWN5h7uMw9XJOau5Veinu0GMvMrAG1HvYrSepRL3fLrAFXNC3vBJ7vZAcRsRQRK/V6vYcYkqTNeinujwNXRcSuiLgY2A880MkOMvNIZi5PT49fT1ySJlm7t0LeCzwKXB0RaxFxa2aeBW4DHgJOAfdl5pODiypJaldbPffMPLDF+DHOcdH0fCJiCViam5vrdheSpBZGOreMbRlJGgwnDpOkAlncJalAIy3u3gopSYNhz12SCmRbRpIKZHGXpALZc5ekAo30MXuZeQQ4Mj8//55BHqf5UXzPfKjlDMSSVBTbMpJUIIu7JBXI4i5JBfKCqiQVyA8xSVKBbMtIUoEs7pJUIIu7JBXI4i5JBfJuGUkqkHfLSFKBbMtIUoEs7pJUoJHOCjkKzhAp6ULgmbskFcjiLkkF8lZISSqQt0JKUoFsy0hSgSzuklQgi7skFcjiLkkFsrhLUoEs7pJUIIu7JBXogptbplnzPDPgXDOSyuGZuyQVyOkHJKlATj8gSQWyLSNJBbK4S1KBLO6SVCCLuyQV6IK+z30zn68qqRSeuUtSgSzuklQgi7skFcie+xbsv0uaZJ65S1KBLO6SVCCLuyQVyOIuSQWyuEtSgQZS3CPihyPidyPiUxHxlkEcQ5K0tbaLe0TcHREvRsTJTeOLEXE6Ip6KiIMAmfnJzHwPcAvwzr4mliSdVydn7oeBxeaBiNgG3AncAOwGDkTE7qZVfql6X5I0RJGZ7a8cMQsczcxrquW9wO2ZeX21fKha9UPV16cz86+22NcysAwwMzNz7erqasfh19fXebr+UsfbdWrPjv4/KWp9fZ2pqam+73fQzD1c5h6uScu9b9++E5k53+q9Xj+hugN4tml5DbgO+BngzcB0RMxl5l2bN8zMFWAFYH5+PhcWFjo+eK1W445HznQRuzPP3LTQ933WajW6+Z5HzdzDZe7hmtTcrfRa3KPFWGbmR4CP9LjvseFUBJImTa93y6wBVzQt7wSeb3fjiFiKiJV6vd5jDElSs16L++PAVRGxKyIuBvYDD7S7cWYeyczl6en+97Ql6ULWya2Q9wKPAldHxFpE3JqZZ4HbgIeAU8B9mfnkYKJKktrVds89Mw9sMX4MONbNwSNiCViam5vrZnNJ0hZGOv2AbRlJGgznlpGkAvkkpg55W6SkSTDSM3dvhZSkwbDnLkkFsi3TA1s0ksaVF1QlqUD23CWpQPbcJalAtmUkqUAWd0kqkMVdkgrkBVVJKpAXVCWpQH6IqU/8QJOkcWLPXZIK5Jn7AHgWL2nUPHOXpAJ5t4wkFci7ZSSpQLZlJKlAFndJKpDFXZIKZHGXpAJZ3CWpQN4KKUkF8lZISSqQbRlJKpDFXZIKZHGXpAJZ3CWpQE75O2BO/ytpFDxzl6QCeeY+RM1n8YcXt48wiaTSWdzHgK0bSf1mW0aSCjTSM/eIWAKW5ubmRhljJJ54rs4tTWfsktRPTj8gSQWyLSNJBfKC6hjzQqukbnnmLkkFsrhLUoEs7pJUIIu7JBXI4i5JBbK4S1KBvBVyQnhbpKROeOYuSQWyuEtSgSzuklQge+5jZraNmSLtv0s6H8/cJalAfS/uEXFlRHwsIu7v974lSe1pq7hHxN0R8WJEnNw0vhgRpyPiqYg4CJCZ/5SZtw4irCSpPe2euR8GFpsHImIbcCdwA7AbOBARu/uaTpLUlcjM9laMmAWOZuY11fJe4PbMvL5aPgSQmR+slu/PzHecY3/LwDLAzMzMtaurqx2HX19f5+n6Sx1vNw5mXgkvfLX3/ezZMdynWK2vrzM1NTXUY/aDuYfL3MOxb9++E5k53+q9Xu6W2QE827S8BlwXEa8BfgV4XUQc2ij2m2XmCrACMD8/nwsLCx0HqNVq3PHImY63Gwfv23OWO57o/WalZ25a6D1MB2q1Gt38WY2auYfL3KPXS3WJFmOZmf8KvLeH/UqSetTL3TJrwBVNyzuB5zvZQUQsRcRKvV7vIYYkabNeivvjwFURsSsiLgb2Aw90soPMPJKZy9PTw+0bS1Lp2r0V8l7gUeDqiFiLiFsz8yxwG/AQcAq4LzOfHFxUSVK72uq5Z+aBLcaPAce6PXhELAFLc3Nz3e7igrfVdAVOSyBd2EY6/YBtGUkaDOeWkaQCWdwlqUAjnfLXnvvg9LMX37yvw4vbu84kaXjsuUtSgWzLSFKBLO6SVCB77hewYdwj7yMBpdGw5y5JBbItI0kFsrhLUoEs7pJUIC+oauQ6vejqRVrp/LygKkkFsi0jSQWyuEtSgSzuklQgi7skFWikxT0iliJipV6vjzKGJBXHu2UkqUC2ZSSpQBZ3SSqQxV2SCmRxl6QCWdwlqUBOHHaB2erpS8PWTg4nCJO6562QklQg2zKSVCCLuyQVyOIuSQWyuEtSgSzuklQgi7skFcjiLkkFsrhLUoEs7pJUIKcf0DcY5RQFgzj27MEHed+es9xy8MG2pjEYx2kPBp1pHL9n9cbpBySpQLZlJKlAFndJKpDFXZIKZHGXpAJZ3CWpQBZ3SSqQxV2SCmRxl6QCWdwlqUAWd0kqkMVdkgpkcZekAlncJalAFndJKlDf53OPiO3AbwH/A9Qy855+H0OSdG5tnblHxN0R8WJEnNw0vhgRpyPiqYg4WA2/Hbg/M98D3NjnvJKkNrTbljkMLDYPRMQ24E7gBmA3cCAidgM7gWer1V7qT0xJUiciM9tbMWIWOJqZ11TLe4HbM/P6avlQteoa8O+ZeTQiVjNz/xb7WwaWAWZmZq5dXV3tOPz6+jpP1yfz/x8zr4QXvjrqFJ3bNb2NqakpAJ54rt7Rtnt2vPzErU637dTmY7X6eXeap3n9djXvt9Pt28m91bHaWX+rbfux//X1daamptrappefUbf5trKRexD6kW+zffv2ncjM+Vbv9VLc3wEsZua7q+UfA64D3g98FPhv4JF2eu7z8/N5/PjxtnI0q9Vq3PIXZzrebhy8b89Z7nhipI+w7crhxe0sLCwAnT/vtPnZnIN+TuvmY7X6eXeap5tni/bybNJ2cm91rHbW32rbfuy/VquxsLDQ1jb9en5rL9//ho3cg9CPfJtFxJbFvZfqEi3GMjPPAD/Rw34lST3q5VbINeCKpuWdwPOd7CAiliJipV4f7K/oknSh6aW4Pw5cFRG7IuJiYD/wQCc7yMwjmbk8Pd19z0mS9I3avRXyXuBR4OqIWIuIWzPzLHAb8BBwCrgvM58cXFRJUrva6rln5oEtxo8Bx7o9eEQsAUtzc3Pd7kKS1MJIpx+wLSNJg+HcMpJUIIu7JBWo7Q8xDeTgVc8deCfw+S52cRnwlb6GGp5JzW7u4TL3cE1a7u/MzNe2emOkxb1XEXF8q09njbtJzW7u4TL3cE1q7lZsy0hSgSzuklSgSS/uK6MO0INJzW7u4TL3cE1q7m8w0T13SVJrk37mLklqweIuSQWa2OK+xfNbR5nnioj4m4g4FRFPRsTPVuOvjohPR8Tnq/++qmmbQ1X+0xFxfdP4tRHxRPXeRyKi1dz5/cy+LSL+NiKOTkrm6piXRsT9EfG56ue+dxKyR8TPV39HTkbEvRHxinHM3erZyf3MGRGXRMTHq/HHqgcCDSr3r1V/Tz4bEX8WEZeOW+6+y8yJ+wK2AV8ArgQuBv4e2D3iTJcDr69efyvwjzSeLfurwMFq/CDw4er17ir3JcCu6vvZVr33GWAvjQei/Dlww4Cz/wLwxzSetMUkZK6O+QfAu6vXFwOXjnt2YAfwNPDKavk+4JZxzA18P/B64GTTWN9yAj8N3FW93g98fIC53wJcVL3+8Djm7vvftVEH6PIPby/wUNPyIeDQqHNtyvgp4AeA08Dl1djlwOlWmWlMnby3WudzTeMHgN8ZYM6dwMPAG3m5uI915uoY30ajSMam8bHOTqO4Pwu8msasrEerwjOWuYHZTUWybzk31qleX0Tjk6ExiNyb3vsR4J5xzN3Pr0lty2z8A9mwVo2NherXtNcBjwEzmfklgOq/316tttX3sKN6vXl8UH4T+EXgf5vGxj0zNH5r+zLw+1VL6fciYvu4Z8/M54BfB74IfAmoZ+ZfjnvuJv3M+fVtsvF8iDrwmoElf9lP0jgT/38ZNuUbx9wdmdTi3vL5rUNP0UJETAF/CvxcZv7HuVZtMZbnGO+7iHgb8GJmnmh3kxZjQ83c5CIav3r/dma+DjhDo02wlbHIXvWof4hGC+A7gO0R8a5zbdJibFQ/83PpJufQv4eI+ABwFrjnPBnGKnc3JrW49/z81kGIiG+mUdjvycxPVMMvRMTl1fuXAy9W41t9D2vV683jg/C9wI0R8QywCrwxIv5ozDNvWAPWMvOxavl+GsV+3LO/GXg6M7+cmV8DPgF8zwTk3tDPnF/fJiIuAqaBfxtU8Ii4GXgbcFNWPZVJyN2tSS3uPT+/td+qK+kfA05l5m80vfUAcHP1+mYavfiN8f3VlfddwFXAZ6pfdf8zIt5Q7fPHm7bpq8w8lJk7M3OWxs/wrzPzXeOcuSn7vwDPRsTV1dCbgH+YgOxfBN4QEd9SHe9NNB5TOe65N/QzZ/O+3kHj79+gfktdBN4P3JiZ/7Xp+xnb3D0ZddO/2y/grTTuSPkC8IExyPN9NH41+yzwd9XXW2n04h6mMaXxw8Crm7b5QJX/NE13OgDzwMnqvY8yhIs1wAIvX1CdlMzfDRyvfuafBF41CdmBXwY+Vx3zD2ncqTF2uYF7aVwX+BqNs9Vb+5kTeAXwJ8BTNO5MuXKAuZ+i0Sff+Ld517jl7veX0w9IUoEmtS0jSToHi7skFcjiLkkFsrhLUoEs7pJUIIu7JBXI4i5JBfo/+FVyZASkhXQAAAAASUVORK5CYII=",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "def remove_newlines(serie):\n",
    "    serie = serie.str.replace('\\n', ' ', regex=False)\n",
    "    serie = serie.str.replace('\\\\n', ' ', regex=False)\n",
    "    serie = serie.str.replace('  ',' ', regex=False)\n",
    "    serie = serie.str.replace('  ',' ', regex=False)\n",
    "    return serie\n",
    "\n",
    "\n",
    "from transformers import GPT2TokenizerFast\n",
    "\n",
    "df['text'] = \"Topic: \" + df.docs + \" - \" + df.category + \"; Question: \" + df.thread + \" - \" + df.question + \"; Answer: \" + df.context\n",
    "df['text'] = remove_newlines(df.text)\n",
    "tokenizer = GPT2TokenizerFast.from_pretrained(\"gpt2\")\n",
    "df['n_tokens'] = df.text.apply(lambda x: len(tokenizer.encode(x)))\n",
    "\n",
    "df.n_tokens.hist(bins=100, log=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<AxesSubplot:>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAXGklEQVR4nO3df4xd5X3n8fdn7eIQHLCpm5HjsdbOykllQ7eNp17S7Lbj0NZugmIqFWmQKaalGi2lKe1SNXb5A/UPa90fpBuWhe4osDiFMnFdWqxEbuO63CIkwMVJGvwDl0ltmYkJTpaEct2Vi+l3/ziPyeHmjmfuz5nj5/OSru653/Occz7XP7733OeemauIwMzM8vDvZjuAmZn1j5u+mVlG3PTNzDLipm9mlhE3fTOzjMyf7QDTWbJkSaxYsaLl7c6cOcNll13W/UA95tz9V9Xszt1fVct98ODBb0fED33fioi44A14CDgNHGqofxI4BhwGfr9U3wZMpHUbSvW1wAtp3b2Apjt2RLB27dpox5NPPtnWdrPNufuvqtmdu7+qlht4Ppr01JlM7zwMbCwXJK0HNgE/EhFrgD9M9dXACLAmbXO/pHlpsweAUWBVur1jn2Zm1nvTNv2IeAp4raF8G7AjIs6mMadTfRMwHhFnI+I4xVn9OklLgcsj4pn0CvQ54PouPQczM5uhdj/I/QDwXyQ9J+nvJP14qi8DXi6Nm0y1ZWm5sW5mZn3U7ge584HFwDXAjwO7JL0fUJOxcYF6U5JGKaaCGBgYoFartRywXq+3td1sc+7+q2p25+6vquZu1G7TnwQeT1M1ByT9G7Ak1ZeXxg0Cp1J9sEm9qYgYA8YAhoaGYnh4uOWAtVqNdrabbc7df1XN7tz9VdXcjdqd3vlL4KMAkj4AXAJ8G9gDjEhaIGklxQe2ByLiFeANSddIEnAz8ESn4c3MrDXTnulLegwYBpZImgTupriM8yFJh4B/Bbaks/7DknYBR4BzwO0R8Vba1W0UVwJdCuxNNzMz66Npm35E3DjFqpumGL8d2N6k/jxwVUvpzMysq/xrGMzMMjLnfw1Dt6zY+sW3l0/s+PgsJjEzmz0+0zczy4ibvplZRtz0zcwy4qZvZpYRN30zs4y46ZuZZcRN38wsI276ZmYZcdM3M8uIm76ZWUbc9M3MMuKmb2aWETd9M7OMuOmbmWUkm1+tXOZfs2xmufKZvplZRqZt+pIeknQ6fR9u47rfkhSSlpRq2yRNSDomaUOpvlbSC2ndvekL0s3MrI9mcqb/MLCxsShpOfAzwMlSbTUwAqxJ29wvaV5a/QAwCqxKt+/bp5mZ9da0TT8ingJea7Lqj4DfBqJU2wSMR8TZiDgOTADrJC0FLo+IZyIigM8B13ca3szMWtPWB7mSPgF8IyL+oWGWZhnwbOnxZKq9mZYb61Ptf5TiXQEDAwPUarWWM9br9Xdsd+fV55qOa2ffvdSYuyqqmhuqm925+6uquRu13PQlvRu4C/jZZqub1OIC9aYiYgwYAxgaGorh4eFWY1Kr1Shvd0vpip2yE5tb33cvNeauiqrmhupmd+7+qmruRu2c6f8HYCVw/ix/EPiypHUUZ/DLS2MHgVOpPtikbmZmfdTyJZsR8UJEvDciVkTECoqG/qGI+CawBxiRtEDSSooPbA9ExCvAG5KuSVft3Aw80b2nYWZmMzGTSzYfA54BPihpUtKtU42NiMPALuAI8FfA7RHxVlp9G/BZig93vw7s7TC7mZm1aNrpnYi4cZr1Kxoebwe2Nxn3PHBVi/nMzKyL/BO5ZmYZcdM3M8uIm76ZWUbc9M3MMuKmb2aWETd9M7OMuOmbmWXETd/MLCNu+mZmGXHTNzPLiJu+mVlG3PTNzDLipm9mlhE3fTOzjLjpm5llxE3fzCwjbvpmZhlx0zczy8hMviP3IUmnJR0q1f5A0ouSvibpLyQtKq3bJmlC0jFJG0r1tZJeSOvuTV+QbmZmfTSTM/2HgY0NtX3AVRHxI8A/AtsAJK0GRoA1aZv7Jc1L2zwAjAKr0q1xn2Zm1mPTNv2IeAp4raH2pYg4lx4+Cwym5U3AeEScjYjjwASwTtJS4PKIeCYiAvgccH2XnoOZmc3Q/C7s45eBz6flZRQvAudNptqbabmx3pSkUYp3BQwMDFCr1VoOVa/X37HdnVefazqunX33UmPuqqhqbqhudufur6rmbtRR05d0F3AOePR8qcmwuEC9qYgYA8YAhoaGYnh4uOVstVqN8na3bP1i03EnNre+715qzF0VVc0N1c3u3P1V1dyN2m76krYA1wHXpikbKM7gl5eGDQKnUn2wSd3MzPqoraYvaSPwKeCnIuJfSqv2AH8q6dPA+yg+sD0QEW9JekPSNcBzwM3A/+ws+vRWTHF2b2aWq2mbvqTHgGFgiaRJ4G6Kq3UWAPvSlZfPRsR/jYjDknYBRyimfW6PiLfSrm6juBLoUmBvupmZWR9N2/Qj4sYm5QcvMH47sL1J/XngqpbSmZlZV/kncs3MMuKmb2aWETd9M7OMuOmbmWXETd/MLCNu+mZmGXHTNzPLiJu+mVlG3PTNzDLipm9mlhE3fTOzjLjpm5llxE3fzCwjbvpmZhlx0zczy4ibvplZRtz0zcwyMm3Tl/SQpNOSDpVqV0raJ+mldL+4tG6bpAlJxyRtKNXXSnohrbtX6XsWzcysf2Zypv8wsLGhthXYHxGrgP3pMZJWAyPAmrTN/ZLmpW0eAEYpvix9VZN9mplZj03b9CPiKeC1hvImYGda3glcX6qPR8TZiDgOTADrJC0FLo+IZyIigM+VtjEzsz5pd05/ICJeAUj37031ZcDLpXGTqbYsLTfWzcysj+Z3eX/N5unjAvXmO5FGKaaCGBgYoFartRykXq9z59VvTTuunX33Ur1en3OZZqKquaG62Z27v6qau1G7Tf9VSUsj4pU0dXM61SeB5aVxg8CpVB9sUm8qIsaAMYChoaEYHh5uOWCtVuOep89MO+7E5tb33Uu1Wo12nu9sq2puqG525+6vquZu1O70zh5gS1reAjxRqo9IWiBpJcUHtgfSFNAbkq5JV+3cXNrGzMz6ZNozfUmPAcPAEkmTwN3ADmCXpFuBk8ANABFxWNIu4AhwDrg9Is7PsdxGcSXQpcDedDMzsz6atulHxI1TrLp2ivHbge1N6s8DV7WUzszMuso/kWtmlhE3fTOzjLjpm5llxE3fzCwjbvpmZhlx0zczy4ibvplZRtz0zcwy4qZvZpYRN30zs4y46ZuZZcRN38wsI276ZmYZcdM3M8uIm76ZWUbc9M3MMuKmb2aWETd9M7OMdNT0Jf2mpMOSDkl6TNK7JF0paZ+kl9L94tL4bZImJB2TtKHz+GZm1oq2m76kZcCvA0MRcRUwDxgBtgL7I2IVsD89RtLqtH4NsBG4X9K8zuKbmVkrOp3emQ9cKmk+8G7gFLAJ2JnW7wSuT8ubgPGIOBsRx4EJYF2HxzczsxYoItrfWLoD2A78P+BLEbFZ0ncjYlFpzHciYrGk+4BnI+KRVH8Q2BsRu5vsdxQYBRgYGFg7Pj7ecrZ6vc7x19+adtzVy65oed+9VK/XWbhw4WzHaFlVc0N1szt3f1Ut9/r16w9GxFBjfX67O0xz9ZuAlcB3gT+TdNOFNmlSa/qKExFjwBjA0NBQDA8Pt5yvVqtxz9Nnph13YnPr++6lWq1GO893tlU1N1Q3u3P3V1VzN+pkeuengeMR8a2IeBN4HPgJ4FVJSwHS/ek0fhJYXtp+kGI6yMzM+qSTpn8SuEbSuyUJuBY4CuwBtqQxW4An0vIeYETSAkkrgVXAgQ6Ob2ZmLWp7eicinpO0G/gycA74CsWUzEJgl6RbKV4YbkjjD0vaBRxJ42+PiOkn3c3MrGvabvoAEXE3cHdD+SzFWX+z8dspPvg1M7NZ4J/INTPLiJu+mVlG3PTNzDLipm9mlhE3fTOzjLjpm5llxE3fzCwjbvpmZhlx0zczy4ibvplZRtz0zcwy4qZvZpYRN30zs4y46ZuZZcRN38wsI276ZmYZcdM3M8uIm76ZWUY6avqSFknaLelFSUclfVjSlZL2SXop3S8ujd8maULSMUkbOo9vZmat6PRM/zPAX0XEDwP/ETgKbAX2R8QqYH96jKTVwAiwBtgI3C9pXofHNzOzFrTd9CVdDvwk8CBARPxrRHwX2ATsTMN2Aten5U3AeEScjYjjwASwrt3jm5lZ6xQR7W0o/SgwBhyhOMs/CNwBfCMiFpXGfSciFku6D3g2Ih5J9QeBvRGxu8m+R4FRgIGBgbXj4+Mt56vX6xx//a1px1297IqW991L9XqdhQsXznaMllU1N1Q3u3P3V9Vyr1+//mBEDDXW53ewz/nAh4BPRsRzkj5DmsqZgprUmr7iRMQYxQsKQ0NDMTw83HK4Wq3GPU+fmXbcic2t77uXarUa7Tzf2VbV3FDd7M7dX1XN3aiTOf1JYDIinkuPd1O8CLwqaSlAuj9dGr+8tP0gcKqD45uZWYvabvoR8U3gZUkfTKVrKaZ69gBbUm0L8ERa3gOMSFogaSWwCjjQ7vHNzKx1nUzvAHwSeFTSJcA/Ab9E8UKyS9KtwEngBoCIOCxpF8ULwzng9oiYftK9x1Zs/eLbyyd2fHwWk5iZ9V5HTT8ivgp83wcFFGf9zcZvB7Z3ckwzM2uffyLXzCwjbvpmZhlx0zczy4ibvplZRtz0zcwy4qZvZpYRN30zs4y46ZuZZcRN38wsI276ZmYZcdM3M8uIm76ZWUbc9M3MMuKmb2aWETd9M7OMuOmbmWXETd/MLCMdN31J8yR9RdIX0uMrJe2T9FK6X1wau03ShKRjkjZ0emwzM2tNN8707wCOlh5vBfZHxCpgf3qMpNXACLAG2AjcL2leF45vZmYz1FHTlzQIfBz4bKm8CdiZlncC15fq4xFxNiKOAxPAuk6Ob2ZmrVFEtL+xtBv478B7gN+KiOskfTciFpXGfCciFku6D3g2Ih5J9QeBvRGxu8l+R4FRgIGBgbXj4+MtZ6vX6xx//a2Wtrl62RUtH6fb6vU6CxcunO0YLatqbqhudufur6rlXr9+/cGIGGqsz293h5KuA05HxEFJwzPZpEmt6StORIwBYwBDQ0MxPDyT3b9TrVbjnqfPtLTNic2tH6fbarUa7Tzf2VbV3FDd7M7dX1XN3ajtpg98BPiEpI8B7wIul/QI8KqkpRHxiqSlwOk0fhJYXtp+EDjVwfHNzKxFbc/pR8S2iBiMiBUUH9D+bUTcBOwBtqRhW4An0vIeYETSAkkrgVXAgbaTm5lZyzo505/KDmCXpFuBk8ANABFxWNIu4AhwDrg9IlqbdDczs450pelHRA2opeX/C1w7xbjtwPZuHNPMzFrnn8g1M8uIm76ZWUZ6MadfWSu2fvHt5RM7Pj6LSczMesNn+mZmGXHTNzPLiJu+mVlG3PTNzDLipm9mlhE3fTOzjLjpm5llxE3fzCwjbvpmZhlx0zczy4ibvplZRtz0zcwy4qZvZpYRN30zs4y46ZuZZaTtpi9puaQnJR2VdFjSHal+paR9kl5K94tL22yTNCHpmKQN3XgCZmY2c518ico54M6I+LKk9wAHJe0DbgH2R8QOSVuBrcCnJK0GRoA1wPuAv5H0gbn65ej+QhUzuxi1faYfEa9ExJfT8hvAUWAZsAnYmYbtBK5Py5uA8Yg4GxHHgQlgXbvHNzOz1ikiOt+JtAJ4CrgKOBkRi0rrvhMRiyXdBzwbEY+k+oPA3ojY3WR/o8AowMDAwNrx8fGWM9XrdY6/3p03EVcvu6Ir+5mJer3OwoUL+3a8bqlqbqhudufur6rlXr9+/cGIGGqsd/wduZIWAn8O/EZE/LOkKYc2qTV9xYmIMWAMYGhoKIaHh1vOVavVuOfpMy1v18yJza0fv121Wo12nu9sq2puqG525+6vquZu1NHVO5J+gKLhPxoRj6fyq5KWpvVLgdOpPgksL20+CJzq5PhmZtaaTq7eEfAgcDQiPl1atQfYkpa3AE+U6iOSFkhaCawCDrR7fDMza10n0zsfAX4ReEHSV1Ptd4AdwC5JtwIngRsAIuKwpF3AEYorf26fq1fumJldrNpu+hHxNM3n6QGunWKb7cD2do85W3z5ppldLPwTuWZmGXHTNzPLSMeXbObGUz1mVmU+0zczy4ibvplZRtz0zcwy4qZvZpYRN30zs4y46ZuZZcSXbHbAl2+aWdX4TN/MLCM+0+8Sn/WbWRW46feAXwDMbK7y9I6ZWUbc9M3MMuLpnR4rT/WUedrHzGaDz/TNzDLipm9mlpG+T+9I2gh8BpgHfDYidvQ7w1ww1bTPnVef45YZTAn5CiEza0dfm76kecD/An4GmAT+XtKeiDjSzxxVNdULhT83MLOZ6veZ/jpgIiL+CUDSOLAJcNPvgaleDDox1buNhzde1tJ+pnqncqHMM3mn43dAZhemiOjfwaRfADZGxK+kx78I/KeI+LWGcaPAaHr4QeBYG4dbAny7g7izxbn7r6rZnbu/qpb730fEDzUW+32mrya173vViYgxYKyjA0nPR8RQJ/uYDc7df1XN7tz9VdXcjfp99c4ksLz0eBA41ecMZmbZ6nfT/3tglaSVki4BRoA9fc5gZpatvk7vRMQ5Sb8G/DXFJZsPRcThHh2uo+mhWeTc/VfV7M7dX1XN/Q59/SDXzMxml38i18wsI276ZmYZuSibvqSNko5JmpC0dZazLJf0pKSjkg5LuiPVr5S0T9JL6X5xaZttKfsxSRtK9bWSXkjr7pXU7BLYbuefJ+krkr5QsdyLJO2W9GL6s/9wFbJL+s307+SQpMckvWsu5pb0kKTTkg6Val3LKWmBpM+n+nOSVvQw9x+kfydfk/QXkhbNtdxdFREX1Y3iA+KvA+8HLgH+AVg9i3mWAh9Ky+8B/hFYDfw+sDXVtwK/l5ZXp8wLgJXpucxL6w4AH6b4eYe9wM/1If9/A/4U+EJ6XJXcO4FfScuXAIvmenZgGXAcuDQ93gXcMhdzAz8JfAg4VKp1LSfwq8Afp+UR4PM9zP2zwPy0/HtzMXdX/+5mO0DXn1DxF/HXpcfbgG2znauU5wmK3z10DFiaakuBY83yUlzp9OE05sVS/Ubgf/c46yCwH/go32v6Vch9OUXzVEN9TmenaPovA1dSXFn3hdSQ5mRuYEVD8+xazvNj0vJ8ip+EVS9yN6z7eeDRuZi7W7eLcXrn/H+c8yZTbdalt3o/BjwHDETEKwDp/r1p2FT5l6Xlxnov/Q/gt4F/K9WqkPv9wLeA/5Ompj4r6bK5nj0ivgH8IXASeAV4PSK+NNdzl3Qz59vbRMQ54HXgB3uW/Ht+meLM/R0ZGvLNxdwzdjE2/Rn9qod+k7QQ+HPgNyLiny80tEktLlDvCUnXAacj4uBMN2lS63vuZD7FW/gHIuLHgDMU0w1TmRPZ0xz4JoqphPcBl0m66UKbNKnN1p/5hbSTs+/PQdJdwDng0WkyzKncrboYm/6c+1UPkn6AouE/GhGPp/Krkpam9UuB06k+Vf7JtNxY75WPAJ+QdAIYBz4q6ZEK5D6fZTIinkuPd1O8CMz17D8NHI+Ib0XEm8DjwE9UIPd53cz59jaS5gNXAK/1KrikLcB1wOZIczNVyN2Oi7Hpz6lf9ZA+1X8QOBoRny6t2gNsSctbKOb6z9dH0lUAK4FVwIH0dvkNSdekfd5c2qbrImJbRAxGxAqKP8O/jYib5nrulP2bwMuSPphK11L8+u65nv0kcI2kd6fjXQscrUDu87qZs7yvX6D499eTM2YVX+z0KeATEfEvDc9nzuZu22x/qNCLG/Axiqtkvg7cNctZ/jPF27uvAV9Nt49RzPPtB15K91eWtrkrZT9G6aoLYAg4lNbdR58+IAKG+d4HuZXIDfwo8Hz6c/9LYHEVsgO/C7yYjvknFFeOzLncwGMUnzu8SXF2e2s3cwLvAv4MmKC4Uub9Pcw9QTEPf/7/5x/PtdzdvPnXMJiZZeRinN4xM7MpuOmbmWXETd/MLCNu+mZmGXHTNzPLiJu+mVlG3PTNzDLy/wFxsE26s/PONwAAAABJRU5ErkJggg==",
      "text/plain": [
       "<Figure size 432x288 with 1 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "df.n_tokens.hist(bins=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df[df.n_tokens < 2000]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get Embeddings, and save them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>docs</th>\n",
       "      <th>category</th>\n",
       "      <th>thread</th>\n",
       "      <th>href</th>\n",
       "      <th>question</th>\n",
       "      <th>context</th>\n",
       "      <th>marked</th>\n",
       "      <th>text</th>\n",
       "      <th>n_tokens</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>huggingface</td>\n",
       "      <td>Beginners</td>\n",
       "      <td>Can’t download (some) models although they are...</td>\n",
       "      <td>https://discuss.huggingface.co/t/cant-download...</td>\n",
       "      <td>Can’t download (some) models to pytorch, altho...</td>\n",
       "      <td>Looking at umarayub/t5-small-finetuned-xsum at...</td>\n",
       "      <td>0</td>\n",
       "      <td>Topic: huggingface - Beginners; Question: Can’...</td>\n",
       "      <td>550</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>huggingface</td>\n",
       "      <td>Beginners</td>\n",
       "      <td>Trainer.push_to_hub is taking lot of time, is ...</td>\n",
       "      <td>https://discuss.huggingface.co/t/trainer-push-...</td>\n",
       "      <td>Hi, I’m trying to push my model to HF hub via ...</td>\n",
       "      <td>@sgugger  can you please help me out with this...</td>\n",
       "      <td>0</td>\n",
       "      <td>Topic: huggingface - Beginners; Question: Trai...</td>\n",
       "      <td>204</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>huggingface</td>\n",
       "      <td>Beginners</td>\n",
       "      <td>SSLCertVerificationError when loading a model</td>\n",
       "      <td>https://discuss.huggingface.co/t/sslcertverifi...</td>\n",
       "      <td>I am exploring potential opportunities of usin...</td>\n",
       "      <td>I’m also getting the same error. Please let me...</td>\n",
       "      <td>0</td>\n",
       "      <td>Topic: huggingface - Beginners; Question: SSLC...</td>\n",
       "      <td>494</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>huggingface</td>\n",
       "      <td>Beginners</td>\n",
       "      <td>How to use embeddings to compute similarity?</td>\n",
       "      <td>https://discuss.huggingface.co/t/how-to-use-em...</td>\n",
       "      <td>Hi, I would like to compute sentence similarit...</td>\n",
       "      <td>With transformers, the feature-extraction pipe...</td>\n",
       "      <td>0</td>\n",
       "      <td>Topic: huggingface - Beginners; Question: How ...</td>\n",
       "      <td>351</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>huggingface</td>\n",
       "      <td>Beginners</td>\n",
       "      <td>How to use additional input features for NER?</td>\n",
       "      <td>https://discuss.huggingface.co/t/how-to-use-ad...</td>\n",
       "      <td>Hello,\\nI’ve been following the documentation ...</td>\n",
       "      <td>mhl:\\n\\ne.g [“Arizona_NNP”, “Ice_NNP”, “Tea_NN...</td>\n",
       "      <td>0</td>\n",
       "      <td>Topic: huggingface - Beginners; Question: How ...</td>\n",
       "      <td>1718</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          docs   category                                             thread  \\\n",
       "0  huggingface  Beginners  Can’t download (some) models although they are...   \n",
       "1  huggingface  Beginners  Trainer.push_to_hub is taking lot of time, is ...   \n",
       "2  huggingface  Beginners      SSLCertVerificationError when loading a model   \n",
       "3  huggingface  Beginners       How to use embeddings to compute similarity?   \n",
       "4  huggingface  Beginners      How to use additional input features for NER?   \n",
       "\n",
       "                                                href  \\\n",
       "0  https://discuss.huggingface.co/t/cant-download...   \n",
       "1  https://discuss.huggingface.co/t/trainer-push-...   \n",
       "2  https://discuss.huggingface.co/t/sslcertverifi...   \n",
       "3  https://discuss.huggingface.co/t/how-to-use-em...   \n",
       "4  https://discuss.huggingface.co/t/how-to-use-ad...   \n",
       "\n",
       "                                            question  \\\n",
       "0  Can’t download (some) models to pytorch, altho...   \n",
       "1  Hi, I’m trying to push my model to HF hub via ...   \n",
       "2  I am exploring potential opportunities of usin...   \n",
       "3  Hi, I would like to compute sentence similarit...   \n",
       "4  Hello,\\nI’ve been following the documentation ...   \n",
       "\n",
       "                                             context  marked  \\\n",
       "0  Looking at umarayub/t5-small-finetuned-xsum at...       0   \n",
       "1  @sgugger  can you please help me out with this...       0   \n",
       "2  I’m also getting the same error. Please let me...       0   \n",
       "3  With transformers, the feature-extraction pipe...       0   \n",
       "4  mhl:\\n\\ne.g [“Arizona_NNP”, “Ice_NNP”, “Tea_NN...       0   \n",
       "\n",
       "                                                text  n_tokens  \n",
       "0  Topic: huggingface - Beginners; Question: Can’...       550  \n",
       "1  Topic: huggingface - Beginners; Question: Trai...       204  \n",
       "2  Topic: huggingface - Beginners; Question: SSLC...       494  \n",
       "3  Topic: huggingface - Beginners; Question: How ...       351  \n",
       "4  Topic: huggingface - Beginners; Question: How ...      1718  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "size = 'curie'\n",
    "\n",
    "from openai.embeddings_utils import get_embedding\n",
    "\n",
    "df['embeddings'] = df.text.apply(lambda x: get_embedding(x, engine=f'text-search-{size}-doc-001'))\n",
    "df.to_parquet('data/curie_embeddings.parquet')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13 (default, Mar 28 2022, 06:59:08) [MSC v.1916 64 bit (AMD64)]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "5fe10bf018ef3e697f9035d60bf60847932a12bface18908407fd371fe880db9"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
